{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# CatBoost JSON model tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/model_analysis/model_export_as_json_tutorial.ipynb)\n",
    "\n",
    "CatBoost supports exporting model to JSON format and loading model from it.\n",
    "\n",
    "This tutorial explains the structure of the JSON model with numeric features only."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download MSRank dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import catboost\n",
    "from catboost import datasets\n",
    "import os\n",
    "import numpy as np\n",
    "\n",
    "train_df, _ = datasets.msrank_10k()\n",
    "X, Y = train_df[train_df.columns[1:]], train_df[train_df.columns[0]]\n",
    "pool = catboost.Pool(\n",
    "    data=X[:1000], # top 1000 documents are enough for this example\n",
    "    label=Y[:1000],\n",
    "    feature_names=list(X.columns)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will train a simple model with trees of depth 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "cls = catboost.CatBoostClassifier(depth=2, random_seed=0, iterations=10, verbose=False)\n",
    "cls.fit(pool)\n",
    "approx = cls.predict(X[0:3], prediction_type=\"RawFormulaVal\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next block save JSON model to file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "cls.save_model(\n",
    "    \"model.json\",\n",
    "    format=\"json\",\n",
    "    # pool=pool  # this parameter is required only for models with categorical features.\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next block loads model from file as JSON and shows its keys.\n",
    "The model json contains __model_info__, __oblivious_trees__ and __features_info__. Model with categorical features will also conatain __ctrs__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'model_info', u'oblivious_trees', u'features_info']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "model = json.load(open(\"model.json\", \"r\"))\n",
    "model.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model info\n",
    "model['model_info'] is analogue for [get_metadata()](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboost_metadata-docpage/) function.\n",
    "You can look on the training parameters the model was trained with or on catboost version that the model was trained with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Svn info:\n",
      "    URL: svn+ssh://arcadia.yandex.ru/arc/trunk/arcadia\n",
      "    Last Changed Rev: 5437739\n",
      "    Last Changed Author: dkvasov\n",
      "    Last Changed Date: 2019-08-08T13:43:02.471284Z\n",
      "\n",
      "Other info:\n",
      "    Build by: eermishkina\n",
      "    Top src dir: /place/home/eermishkina/trunc/arcadia\n",
      "    Top build dir: /home/eermishkina/.ya/build\n",
      "    Hostname: su57.search.yandex.net\n",
      "    Host information: \n",
      "        Linux su57.search.yandex.net 4.4.88-42 #1 SMP Mon Sep 18 14:33:37 UTC 2017 x86_64\n",
      "\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(model['model_info']['catboost_version_info'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Features info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look now on the features_info value.\n",
    "It could contain $float\\_features$, $categorical\\_features$ and $ctrs$, which are lists of descriptions of some features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'float_features']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['features_info'].keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our case (model without cateforical features) there are only one fields - $float\\_features$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='features_info'></a>\n",
    "Every float feature is described in the following way:\n",
    "__flat_feature_index__ (int) - feature index in pool, zero-based indexation \n",
    "\n",
    "__feature_index__ (int) index among only float features, zero-based indexation. For example, in dataset that looks like \\[float, categ, float\\], the second float feature has indices $flat\\_feature\\_index = 2$ and $feature\\_index = 1$. \n",
    "Because it is the 2 feature of all in 0 based indexation and 1 feature of numeric ones (here we exclude the caterorical feature).\n",
    "For a model without categorical features $feature\\_index$ will be equal to $float\\_feature\\_index$ for every feature.\n",
    "\n",
    "__borders__ (list of all borders (or splits) used in the model for this particular float feature). Float feature values can be splits by $border$ value. All elements with feature value $> border$ go to the left subtree, and all elements with feature value ($<= border$) elements go to the right subtree.\n",
    "\n",
    "__has_nans__ (bool) This field shows if there were any nan values in the training dataset, which was used to train the model.\n",
    "\n",
    "__nan_value_treatment__ ('AsIs', 'AsTrue' or 'AsFalse') If the feature has had nan values in the training dataset, then there is an additional split that puts nans to the left and everything else to the right (if 'AsFalse') or an additional split that puts nans to the right and everything else to the left (if 'AsTrue').\n",
    "'AsIs' is internal default value for features without nan values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{u'borders': [83.5],\n",
       " u'feature_index': 0,\n",
       " u'flat_feature_index': 0,\n",
       " u'has_nans': False,\n",
       " u'nan_value_treatment': u'AsIs'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['features_info']['float_features'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shows that the first feature has a list of borders that were used in the model. And this feature had no nan values in train."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Symmetric trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CatBoost uses so-called symmetric or oblivious trees. For each level of the tree CatBoost uses the same features to split learning instances into the left and the right partitions: on the first level tree is partitioned by first split into two parts, on the second level each subtree splits with second split and so on.\n",
    "\n",
    "In this case a tree of depth $k$ has exactly $2^k$ leaves and $k$  splits, each split on a subsequent layer.\n",
    "\n",
    "There are three types of splits: \"FloatFeature\", \"OneHotFeature\" and \"OnlineCtr\". A model without cateforical features contains only float feature slits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's look on how JSON model describes a single tree.\n",
    "A tree of depth $k$ is described by $2^k$ leaf_values, $2^k$ leaf_weights and $k$ splits. Let's take a look on the first tree of our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"leaf_values\": [\n",
      "    0.022173912547853145, \n",
      "    0.017826086558078085, \n",
      "    0.011304347573415556, \n",
      "    -0.02565217333967221, \n",
      "    -0.02565217333967221, \n",
      "    0.07652173742004059, \n",
      "    -0.016956521360122625, \n",
      "    -0.019130434355010134, \n",
      "    -0.02130434734989765, \n",
      "    -0.019130434355010148, \n",
      "    0.02407969585645712, \n",
      "    0.027495255552409406, \n",
      "    0.0035863376807432745, \n",
      "    -0.026869069608165402, \n",
      "    -0.028292219481478875, \n",
      "    0.07398772840759553, \n",
      "    -0.004233128739738201, \n",
      "    -0.012975459832675437, \n",
      "    -0.0286196312621424, \n",
      "    -0.02815950857304045\n",
      "  ], \n",
      "  \"splits\": [\n",
      "    {\n",
      "      \"split_index\": 4, \n",
      "      \"float_feature_index\": 16, \n",
      "      \"border\": 10.407758712768555, \n",
      "      \"split_type\": \"FloatFeature\"\n",
      "    }, \n",
      "    {\n",
      "      \"split_index\": 3, \n",
      "      \"float_feature_index\": 13, \n",
      "      \"border\": 3.5, \n",
      "      \"split_type\": \"FloatFeature\"\n",
      "    }\n",
      "  ], \n",
      "  \"leaf_weights\": [\n",
      "    123, \n",
      "    54, \n",
      "    512, \n",
      "    311\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "def dump_json(item):\n",
    "    print(json.dumps(item, indent=2))\n",
    "\n",
    "dump_json(model['oblivious_trees'][0])  # first_tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The list of \"leaf_values\" describes the values in leaves. This is a tree with 4 leaves. It has depth 2, which means it has two different splits. The first split is used to split all the objects into left and right.\n",
    "And the second split is used twice, to split the left objects into two parts, and to split the right objects into two parts.\n",
    "The indices in the list can be represented using base-2 numeral system in the following way: 00, 01, 10, 11. Leaf 00 is the leaf where the 0-th split and the 1-st split are equal to False. Leaf 01 contains the objects where 0-th split is equal to False and 1-st feature is equal to True. And so on.\n",
    "\n",
    "The next part of the tree description is called \"leaf_weights\". This list represents sum of weights of training samples, that are in this leaf. Leaf indexation in this list is the same as in \"leaf_values\".\n",
    "\n",
    "The last part is \"splits\", and it is description of the two splits that are used in the tree of depth two.\n",
    "Each of the descriptions contains several key-values. Firstly, it contains internal CatBoost parameter $split\\_index$. This is the only parameter that is used by catboost when loading the model, all other parts of in \"splits\" are ignored (they are duplicated in a different place), and are present here only to do the model easier to understand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Let's first describe the other fields. Split type \"FloatFeature\" means that it is so called 'float split'. Float split condition $float\\_feature[float\\_feature\\_index] > border$ (see above in [Features info](#features_info)) is decripted with \"float_feature_index\" and \"border\" accordingly.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This description should enable you to analyze the model.\n",
    "But if you will want to change the model, you will have to change \"split_index\" in a right way.\n",
    "To to do that let's explain, how this feature is built. Look one more time at features info:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{u'borders': [83.5],\n",
       " u'feature_index': 0,\n",
       " u'flat_feature_index': 0,\n",
       " u'has_nans': False,\n",
       " u'nan_value_treatment': u'AsIs'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature = model['features_info']['float_features'][0]\n",
    "feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each float split determines by feature index and border value, hence feature description specifies len(feature['borders']) splits. List all float splits with first feature from model['features_info'] with border value from features $borders$ list, with second feature and so on. This is the order, in which splits are enumerated in  model. Splits numbering begins with 0. Build split list in this order\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "split_list = []\n",
    "for float_feature in model['features_info']['float_features']:\n",
    "    if not float_feature['borders']:\n",
    "        continue\n",
    "    for border in float_feature['borders']:\n",
    "        split_list.append(\n",
    "            {\n",
    "                'split_index': len(split_list),\n",
    "                'float_feature_index': float_feature['feature_index'], \n",
    "                'border_id': border, \n",
    "                'split_type': 'FloatFeature',\n",
    "                'flat_feature_index': float_feature['flat_feature_index']\n",
    "            }\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensure, that splits in first tree and corresponding splits from obtained above split_list are identical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{u'border': 10.407758712768555,\n",
       "  u'float_feature_index': 16,\n",
       "  u'split_index': 4,\n",
       "  u'split_type': u'FloatFeature'},\n",
       " {u'border': 3.5,\n",
       "  u'float_feature_index': 13,\n",
       "  u'split_index': 3,\n",
       "  u'split_type': u'FloatFeature'}]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_tree = model['oblivious_trees'][0]\n",
    "first_tree['splits']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'border_id': 10.407758712768555,\n",
       "  'flat_feature_index': 16,\n",
       "  'float_feature_index': 16,\n",
       "  'split_index': 4,\n",
       "  'split_type': 'FloatFeature'},\n",
       " {'border_id': 3.5,\n",
       "  'flat_feature_index': 13,\n",
       "  'float_feature_index': 13,\n",
       "  'split_index': 3,\n",
       "  'split_type': 'FloatFeature'}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "split_indexes = [x['split_index'] for x in first_tree['splits']]\n",
    "[split_list[index] for index in split_indexes]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiclassification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only difference between classification or regression vs multiclassification is the leaves count in each tree. Model contains leaf values for each class, so tree depth of $k$ has $2^k$ leafs and in json model are stored  $2^k \\cdot classes\\_count$ leaf values in this order: first $2^k$ values for first class, second  $2^k$ values for second and so on. Leaf weights count is $2^k$ as they are the same for all classes. Look at first tree of multiclass model trained on  [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{u'leaf_values': [0.05084745649058943,\n",
       "  -0.025423728245294732,\n",
       "  -0.025423728245294732,\n",
       "  -0.02526315733006127,\n",
       "  0.04578947266073611,\n",
       "  -0.020526315330674765,\n",
       "  0,\n",
       "  0,\n",
       "  0,\n",
       "  -0.025573769920184966,\n",
       "  -0.018196720904746992,\n",
       "  0.04377049082493207],\n",
       " u'leaf_weights': [50, 48, 0, 52],\n",
       " u'splits': [{u'border': 0.800000011920929,\n",
       "   u'float_feature_index': 3,\n",
       "   u'split_index': 58,\n",
       "   u'split_type': u'FloatFeature'},\n",
       "  {u'border': 1.5499999523162842,\n",
       "   u'float_feature_index': 3,\n",
       "   u'split_index': 64,\n",
       "   u'split_type': u'FloatFeature'}]}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get Iris dataset \n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# Train the model\n",
    "cls_multilclass = catboost.CatBoostClassifier(loss_function='MultiClass', depth=2, random_seed=0, verbose=False)\n",
    "cls_multilclass.fit(iris.data, iris.target)\n",
    "\n",
    "# Save model\n",
    "cls_multilclass.save_model(\n",
    "    \"multiclass_model.json\",\n",
    "    format=\"json\",\n",
    "    # pool=pool  # is required for model with cat_features to obtain applicable model\n",
    ")\n",
    "\n",
    "multilclass_model = json.load(open(\"multiclass_model.json\", \"r\"))\n",
    "multilclass_model['oblivious_trees'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Truncate model\n",
    "Model can be modified and applied. Truncate and apply model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "trees = model['oblivious_trees'][:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.46872039,  0.02717889, -0.02942634, -0.23159877, -0.23487417],\n",
       "       [ 0.16342239,  0.23999561,  0.08782089, -0.24248687, -0.24875202],\n",
       "       [ 0.33317769,  0.15813378, -0.0085402 , -0.24050062, -0.24227064]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "approx # = cls.predict(X[0:3], prediction_type=\"RawFormulaVal\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.27934996,  0.00999965, -0.04288901, -0.12313168, -0.12332892],\n",
       "       [ 0.08037036,  0.13138144,  0.04135186, -0.124843  , -0.12826065],\n",
       "       [ 0.19021373,  0.07576716, -0.01227257, -0.12547529, -0.12823304]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['oblivious_trees'] = trees[0:5]  # use only first 5 trees\n",
    "json.dump(model, open(\"head_model.json\", \"w\"))  # Save modified model\n",
    "cls.load_model(\"head_model.json\", \"json\")  # load model\n",
    "cls.predict(X[0:3], prediction_type=\"RawFormulaVal\")  # apply model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.27934996,  0.00999965, -0.04288901, -0.12313168, -0.12332892],\n",
       "       [ 0.08037036,  0.13138144,  0.04135186, -0.124843  , -0.12826065],\n",
       "       [ 0.19021373,  0.07576716, -0.01227257, -0.12547529, -0.12823304]])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model['oblivious_trees'] = trees[5:]  # drop first 5 trees\n",
    "json.dump(model, open(\"tail_model.json\", \"w\"))  # Save modified model\n",
    "cls.load_model(\"tail_model.json\", \"json\")  # load model\n",
    "cls.predict(X[0:3], prediction_type=\"RawFormulaVal\")  # apply model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
